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1 Introduction 

In this paper, we consider the statistical analysis of high dimensional struc- 
tured data in two close setups: vectors with small support and matrices with 
low rank. In the first setup, known as Compressed Sensing (CS) [201 [151 El El 
EH |9], the aim is to reconstruct a high dimensional vector with only few non- 
zero coefficients, based on a small number of linear measurements. In the 
second setup, called Matrix Completion [TOl |23l El |26] , we aim at reconstruct- 
ing a small rank matrix from the observations of only a few entries. Both 
problems are motivated by many practical applications in many different do- 
mains (medical |22], imaging [12], seismology [16], recommending systems 
such as the Netflix Prize, etc.) as well as theoretical challenges in many dif- 
ferent fields of mathematics (random matrices, geometry of Banach spaces, 
harmonic analysis, empirical processes theory, etc.). From an algorithmic 
viewpoint, one central idea is the convex relaxation of the £o-functional (the 
function giving the number of non-zero coefficients of a vector) and of the 
rank function. This idea gave birth to two well-known algorithms: the Basis 
Pursuit algorithm [15] and nuclear norm minimization |5j. Many results 
have been obtained for these two algorithms and we refer the reader to the 
next sections for more details. Here we will be interested in weighted ver- 
sions of these algorithms, see [TT] in the CS setup. In particular, we will be 
interested in finding theoretical explanation underlying the fact that, empir- 
ically, it is observed that weighted Basis pursuit outperforms classical Basis 
Pursuit. We will also propose a way to export the idea of reweighting into 
the Matrix Completion problem. 

2 Weighted basis-pursuit in Compressed Sens- 
ing 

One way of setting the CS problem is to ask the following question. Starting 
with amx N matrix A, called a sensing or measurement matrix, and with a 
vector X in M^, is it possible to reconstruct x from the linear measurements 
Ax7 Classical linear algebra theory tells that we need at least m > N to 
recover x from Ax in order to find a unique solution to the linear system. But, 
if more is known on x, then, hopefully, a smaller number m of measurements 
may be enough. 

In the theory of CS, it is now well-understood that it is indeed possible to 



recover sparse signals (signals with a small support, the support being the set 
of non-zeros entries) from a small number of linear measurements. If a; is a 
sparse vector and A a "good" measurement matrix (in a sense to be clarified 
later), then looking for a vector y with the smallest support and satisfying 
Ay = Ax can recover x exactly. This procedure, called the io or support 
minimization procedure, is known to be the best theoretical procedure to 
recover any s-sparse vector x (vectors with a support size smaller than s) 
from Ax as long as A is injective on the set of all s-sparse vectors. However, 
this problem is NP-hard, and alternatives are suitable in practice, in part 
because the function x i— )■ |a;|o (|a;|o stands for the cardinality of the support 
of x) is not convex. 

A natural remedy to this problem is convex relaxation. In [15], the au- 
thors propose to minimize the £i-norm as the convex envelope of this non- 
convex function, leading to the so-called Basis-Pursuit algorithm (BP). The 
BP algorithm minimizes the £i norm on the affine space x + ker A. Namely, 
consider, for any y G M"^: 

Ai(y) G argmin (|t|i : At = y), (2.1) 



SO that Ai{Ax) is a candidate for the reconstruction of x based on Ax. We 
say that x is exactly reconstructed by Ai, namely Ai{Ax) = x, when x is 



the unique solution of the minimization problem (2.1) when y = Ax. 

Note that other algorithms have been introduced in the CS literature. For 
instance, £p-minimization algorithms for < p < 1 are considered in fiSl [2^ 
HHl [H] . Some greedy algorithms based on the ideas of the Matching Pursuit 
algorithm of [121 ES] have been used in CS, see [2H1 EH SS] for instance. 

In the present paper, we consider weighted-£i minimization over x+ker A. 
This algorithm was introduced in [Tl]. Since then, it has drawn a particular 
attention because it is now acknowledged, although mainly only empirically 
observed, that a proper weighted basis-pursuit algorithm can improve a lot 
upon basic basis-pursuit. This is illustrated in Figure [T| and many other 
numerical experiments can be found in [TT]. However, theoretical explana- 
tions of this fact are still lacking. Some results that go in this direction are 
given in [311 |511 |32], [H], [31]. But, the results given in these papers are 
of a different nature than ours, since they are using a random model for 
the unknown vector x, such as a vector with i.i.d iV(0, 1) non-zero entries, 
with a distribution support which is uniform conditionally on the sparsity. 



In the statement of our results, x is an arbitrary deterministic sparse vec- 
tor. In [T8] an iteratively reweighted least-squares procedure is studied, as 
an approximation of basis-pursuit. 

We introduce the weighted algorithm: for any y G M*" and any sequence 
w = {wi, . . . , wn) e M.^ of non-negative weights, 

A^{y) G argmin f V ^ : At = y) . (2.2) 

We use the convention t/0 = oo when t > and 0/0 = 0. Note that, under 



this convention, the algorithm (2.2) is defined according to the support I^ of 

w by 



{AUy)),. = and [A^y))^ G argmin ( V ^ : AjJ = y), (2.3) 



It, 



where if t G M and / C {1, . . . , A^}, we denote by t/ the vector such that 
(^/)j = tj if z G / and (t/)j = if z ^ J. Once again, we say that x is exactly 
reconstructed by A^y, namely A^^Ax) = x, when x is the unique solution of 



the minimization problem (2.2) when y = Ax. In particular, this requires 



that the support of x is included in the support of w. 
2.1 No-loss property 



.Af 



Note that when the weight vector w is close to x, then X]i=i \^i\/'^i is close 
to |x|o. Moreover, for "reasonable" matrices A, the vector x is the one with 
the shortest support in the afiine space x + ker A. So, a natural choice for w 



in (2.2) is w = |Ai(y4a;)|. We denote this decoder by A2: 

A^iy) G argmin ( V ' ?' ■.At = y). (2.4) 

The next Theorem proves that A2 is at least as good as the Basis Pursuit 
algorithm Ai. 

Theorem 1. Let x G M^. If Ax{Ax) = x, then A2{Ax) = x. 

The proof of Theorem [T] is based on the well-known null space prop- 
erty and dual characterization of [6j , see Section |4] below. However, it was 



observed empirically in [TT] that it is better to consider positive weights, 
and thus, to consider, for some e > 0, the weights Wi = \Ai{y)i\ + e 
for i = 1, . . . ,N. This is easily understood: if for some i G {1, . . . , N}, 
Ai{Ax)i = while Xj 7^ 0, then A2{Ax)i is also equal to and there is 
no hope to recover x using A2 as well. By adding an extra e term to each 
weights, the necessary support condition supp(x) C supp(u7) to reconstruct 
X from A^(Ax) is satisfied (see for instance Proposition [I] in Section El). The 
choice of e > can be done in a data-driven way, see |llj . 

2.2 An empirical evidence 

In Figure [T| we give a simple illustration of the fact that weighted basis- 
pursuit can improve a lot upon basic basis-pursuit, using a simple numerical 
experiment. For many combinations of m (y-axis) and s (x-axis), we repeat 
the following experiment 50 times: draw at random a sensing matrix A with 
i.i.d A^(0, l/m) entries and draw at random a vector with s non-zero coordi- 
nates chosen uniformly, with i.i.d A^(0, 1) non-zero entries. Then, compute 
xi = Ai{Ax) and x^ = A2q{Ax) (here we take e = 0.01 without further 
investigation), where A|(Ax) is computed iteratively, using 



A 



e 



"^ It- 



k+AAx) e argmin ( V , ^ . , , : At = Ax] . (2.5) 

Then, we count the number of exact reconstructions achieved by xi and x^ 
over the 50 repetitions. The plots on the left are the exact recovery counts 
of Xi (black means exact recovery over the 50 repetitions) while the plots on 
the right are the exact recovery counts of x^. In these figures, exact recovery 
is declared exact when \x — x|2/|x|2 < rj, where we take rj = 10~^ on the 
first line and rj = 10~^ on the second line. The red curve is a theoretical 
"phase-transition" threshold s 1— )■ slog(er?7,/s). We observe in these figures 
that Xw improves a lot upon xi, in particular when r] = 10^^. 

2.3 A theoretical explanation 

Now, we want to understand if A2 can do better than Ai, and why. In 
particular, if Ai{Ax) is close to x (but fails to reconstruct exactly x), under 
which condition do we get A2{Ax) = xl In general, given a weight vector 
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Figure 1: Exact recovery counts (black means exact recovery) of basis-pursuit 
(left column) and weighted basis-pursuit (right column), where the x-axis is 
the sparsity (s) and the y-axis is the number of measurements (m). Exact 
recovery is declared with a tolerance equal to 10~^ on the first line, and equal 
to 10~^ on the second line. The red curve is a theoretical phase-transition 
threshold s h- )■ slog(em/s) 



w G M^, what conditions on w can insure that Aw(Ax) = x7 In Theorem p^ 
below, we use the duahty argument of [6] to prove that the condition 

(AO)(/,C) \wj.mi/w)j\^<C, (2.6) 

where / is the support of x and C > is such that 

where S and n are, respectively, the restricted isometry and incoherency 
constants [SI El E] of the matrix A, ensure that the w-weighted algorithm A^^ 
recovers exactly x given Ax. 

It is interesting to note that, so far, only random matrices are able to 
satisfy the incoherency and isometry properties for small values of m. Thus, 
if one wants the number m of measurements to be of the order (up to some 
logarithmic factor) of the sparsity of the vector to recover, one has to con- 
sider random matrices. This leads to results in Compressed Sensing that 
hold with a large probability, with respect to the randomness involved in the 
construction of the sensing matrix. In practice, however, the most interest- 
ing sensing matrices are structured matrices, like the Fourier or the Walsh 
matrices (see [SIIIS]), since these matrices can be stored and constructed by 
efficient algorithms. A lot of research go in this direction, and we don't con- 
sider this problem here, but rather focus on weighted algorithms. Therefore, 
we will state our probabilistic results for a simple (and somehow universal) 
sensing matrix A with entries being i.i.d. centered Gaussian variables with 
variance 1/m. 

Theorem 2. Let x G M^ and denote by I its support and by s the cardinality 
of I. Let C, /i > and < 6 < 1. Assume that 



m> cq max 



s slog A^ 
52' /i2 . 



and C < , 



where cq is a purely numerical constant. Consider the event Q{I, C) = 
{|u^/<:|oo|(l/'^)/|2 — ^} ^'^^ let A be a m X N matrix with entries being 
i.i.d. centered Gaussian random variables with variance 1/m. Then, with 
probability larger than 

1-2 exp{-cim6^) - exp ( - C2fi^m/s) - P Q{I, Cf 

the vector X is exactly reconstructed by Aw{Ax). 



Theorem [2] gives an explicit condition, linking the incoherency constant 
H, the restricted isometry constant 6, and the constant C from condition 
AO{I, C) on the weights w that ensures the exact reconstruction of x using 
Au,. This is the first result of this nature for weighted basis pursuit. 

When wjc = then {AO){I,C) holds with C = 0, so that one can take 
6 = 1 and /x = +oo. This is the case for w = {\Ai{Ax)i\)fLi when Ai{Ax) = 
X. This condition is also satisfied when the weights vector w is close enough 
to |x| and when the absolute value of the non-zero coordinates of |x| are 
sufficiently large. For instance, {AO){I,C) holds when 



min|xi| > 1 + -^^— — |w - |x||oo. (2.7) 



Indeed, if we denote e = \w — |x||oo then {AO){I,C) follows from (2.7) since 
max jg/c Wi < e and 



1 
w/ 1 



< — ^^-^ — < 

miiii^jWi mmi^i\Xi\-e 



In particular, if AO{I, C) is satisfied with C = ci/ yJlogN , for some constant 
< Ci < 1, then a proportional to s number of Gaussian measurements will 
be enough to get /S.w{Ax) = x with a large probability. 

In Figure [2] below, we give an empirical illustration of the fact that 
AO{I, C) is indeed a relevant condition for exact reconstruction of weighted 
basis-pursuit. We consider exactly the same experiment as what we did in 



Section [2.2[ but this time we fix the number of measurements io ra = 110 
and the sparsity of a; to s = 45. For this combination of m and s, the phase 
transition occurs, namely basis pursuit can either work or not, see Figure [1} 
so we can expect for these values a strong improvement of weighted basis- 
pursuit over non-weighted one. On the left-side of Figure |2| we show the 
value of the constant C over the reweighting iterations. Namely, if / is the 
support of the true unknown vector x, we compute for k = 1, . . . ,K the 
values of 

O — \Wjc |oo|l,J-/"^ J/j^, 

where 

ti;(^) = \Al{Ax)\+e 

over the 10 repetitions (differentiated by different colors), where we recall 



that AK^x) is given by (2.5) and where we choose K = 30. On the right- 



side of Figure |2| we show the logarithm of relative reconstruction errors over 



the iterations, namely 



erik = log 



\Al{Ax) 



Xh 



X 2 



(we take the logarithm only for illustrational purpose, so that we can see the 
cases when exact reconstructions occurs). Each repetition of the experiment 
is represented with a different color. 

What we observe is a direct correspondence between the constant C from 
Assumption AO{I, C) and the quality of reconstruction of weighted basis 
pursuit along the iterations. This tells that Assumption ^40(1, C) indeed 
explains (at least in the considered configuration) when exact reconstruction 
can or cannot happen using weighted basis pursuit. 





Figure 2: Logarithm of the value of the constant C from Assumption A0(/, C) 
(left) and logarithm of the relative reconstruction error of weighted basis 
pursuit over the iterations (right). 



Remark 1 . Note that uniform results can also be derived for the weighted-£i 
algorithm. Indeed, by using classical machinery, it can be proved that 1) 
implies 2) implies 3) where: 



1. for all X G Sg, Adia.g{w) satisfies RIP((5, 8s) and I^ C /„ 

2. SUP2,g]jgj(^(jiag{«)))nBf 

3. for any x G S^, /S.w{Ax) 



_i_ 

2y/~S 



?iv |xU < TT-Fz and Mx G Ss, Ix C lu 



X. 
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But, it is not clear why, for instance when w = Ai{Ax), it would be easier for 
the matrix A disig{Ai{Ax)) to satisfy RIP than for A itself. The same remark 
also holds for the euclidean section of B^ by the kernel of A dia:g{Ai{Ax)) 
or A. These approaches look too crude to perform a study of £i-weighted 
algorithms, where most of the gain can be done only on the absolute multi- 
plying constant in front of the minimal number of measurements m needed 
for exact reconstruction. 

2.4 Verifying exact reconstruction 

Thanks to Theorem [T| it is easy to test if we were able to reconstruct exactly 
a vector x given Ax. So far, we have to rely on the theory to insure that 



with a high probability, we have Ai{Ax) = x. Using (2.4), we can verify this 
belief. Indeed, Theorem [I] entails that A2{Ax) = x when Ai(Ax) = x. In 
particular, if Ai{Ax) ^ A2{Ax), then we are sure that we didn't perform the 
exact reconstruction of x using Ai{Ax). Then, we can iterate the mechanism 
and define for any k > 1 

Ak+i{Ax) e argmin ( V" / : At = Ax), 

tm^ ^"^lAkiAxjil J 

leading to a sequence 

Ai{Ax), A2{Ax), ■■■ , Ar{Ax). (2.8) 



If the sequence (2.8 ) does not become constant after a certain number of iter- 
ations, then it is very likely that none of the algorithm Ak{Ax) reconstructed 
exactly x. We also have the following reverse statement. Denote by S^ the 
set of all fc-sparse vectors in 



oN 



Theorem 3. Let A be a m x N injective matrix on T,m o-nd let x G E|^m/2j- 
The following statements are equivalent: 

1. There exists an integer r such that Aj.{Ax) = x, 

2. The sequence Ai{Ax) , A2{Ax) , . . . , becomes constantly equal to a \m/2\ - 
sparse vector after a certain number of iterations. 

Note that the matrix with i.i.d. standard Gaussian entries is injective on 



Em with probability one. Thus, we propose to compute the sequence (2.8) 
as an empirical test for the exact reconstruction of a vector x from Ax. 

10 



3 Iteratively weighted soft-thresholding for 
matrix completion 

In many applications, data can be represented as a database with missing 
entries. The problem is then to fill the missing values of the database, lead- 
ing to the so-called matrix completion problem. For instance, collaborative 
filtering aims at doing automatic predictions of the taste of users, using the 
collected tastes of every users at the same time [25]. The popular Netfiix 
prize is a popular application of this problem^ Other applications include 
machine-learning [i\, control [37j, quantum state tomography ^27j, structure 
from motion [18], among many others. This problem can be understood as a 
non-commutative extension of the compressed sensing problem. So, a natu- 
ral question is the following: Does the principle of iterative weighting of the 
ii-norm work also for matrix completion? In this Section, we prove empiri- 
cally that the answer to this question is yes. We prove that one can improve 
the convex relaxation principle for matrices, which is based on the nuclear 
norm [10] , [26] , by using a weighted nuclear norm, in the same way as we did 
for vectors in Section [2j However, note that there is, as explained below, a 
major difference between the vectors and matrices cases at this point, since 
a weighted nuclear norm is not convex in general, while a weighted £i-norm 
is. 

Let us first recall standard definitions and notations. Let Aq G ]R"i^"2 \^q 
a matrix with ni rows and n2 columns. The matrix Aq is not fully observed. 
What we observe is a given subset Q C {1, . . . , ui} x {1, . . . , n2} oi cardinality 
m of the entries of Aq, where m <^ nin2. For any matrix A G IR"!^"'^^ -^^g 
define the masking operator Vn{A) G ]R"i^"2 g^^h that {Vn{A))j^k = Aj^k 
when (j, k) en and (^^(A))^-^ = when (j, k) ^ n. We define also V^iA) = 
A-Pn{A). 

Since we consider the case where m -C nin2, the matrix completion prob- 
lem is in general severely ill-posed. So, one needs to impose a complexity 
or sparsity assumption on the unknown matrix Aq. This is done by assum- 
ing that Aq has low rank, which is the natural extension of the sparsity 
assumption for vectors to the spectrum of a matrix. For the problem of ex- 
act reconstruction, other geometrical assumptions are necessary (such as the 
incoherency assumption, see [3 UHl EI])- Under such assumptions, it is now 
well-understood that the principle of convex relaxation of the rank function 
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is able to reconstruct exactly the unknown matrix from few measurements, 
see |5l Uni ESI Us] . Indeed, a natural approach would be to solve the problem 

minimize rank74 

(3 1) 
subject to VniA) = VniAo), 

but this minimization problem is known to be very hard to solve in practice 
even for small matrices, see for instance ^ HU] . The convex envelope of the 
rank function over the unit ball of the operator norm is the nuclear norm, 
see [23], which is given by 

niAn2 

Pill = J2 ^^•(^)' 

i=i 

(it is the bi-conjugate of the rank function over the unit ball of the opera- 
tor norm), where <7i{A) > ■ ■■ > cr„^An2(^) are the singular values of A in 



decreasing order. So, the convex relaxation of (3.1) is 

minimize ||A||i 



(3 2) 
subject to Vn{A) = Vn{Ao). 

This problem has received a lot of attention quite recently, see [51 [101 ESI 
l30l H3] . among many others. The point is that, in the same way as the 



basis pursuit for vectors, (3.2) is able to recover exactly Aq with a large 
probability, based on an almost minimal number of samples (under some 
geometrical assumption) . 

In literature concerned about computational problems [31], [3S], [13 E3], 



among others, the relaxed version of (3.2) is considered, since it is easier to 



construct a solver for it (one can apply generic first-order optimal methods, 
such as proximal forward-backward splitting [T7], among many other meth- 
ods) and since it is more stable in the presence of noise. Note that the SVT 
algorithm of ^ gives a solution under equality constraints for an objective 
function with an extra ridge term ||v4||i + r||A||2. The relaxed problem is 
simply formulated as penalized least-squares: 

A,e argmin \h\Vn{A) - Vn{Ao)\\l + \\\A\\A , (3.3) 



where A > is a parameter balancing goodness-of-fit and complexity, mea- 
sured by the nuclear norm. 
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Before we go on, we need some notations. The vector of singular values of 
A is denoted by cr{A) = {ai{A), . . . ,ar{A)), sorted in non-increasing order, 
where r is the rank of A. We define, for p > 1, the p-Schatten norm by 

\\A\\p= \cr{A)\p, 

which is the ip norm of cr(y4). We shall denote also by ||y4|| = ||v4||oo = <^i{A) 
the operator norm of A, and note that ||74||2 is the Frobenius norm, associated 
to the Euclidean inner product (^A,B'^ = ti^A^B), where tr^A) stands for 
the trace of A. For any matrix A its singular values decomposition (SVD) 
writes as A = U diag{a{X))V~^ , where diag(cr(X)) is the diagonal matrix 
with (j{A) on its diagonal, and U and V are, respectively rii x r and n2 x r 
orthonormal matrices. 

3.1 A new algorithm for matrix completion 

We have in mind to do the same as we did in Section [2] for the reconstruction 
of sparse vectors. For a given weight vector w = {wi, . . . ,Wni/\n2)y with 
Wi > ■ ■ ■ > w„jAn2 ^ 0, we consider 

i^G argmin \h\Vn{A) - VniAo)\\l + X\\A\uA, (3.4) 



where ||A||i,«, is the weighted nuclear-norm 



711 An2 

|^||i,«, = V ^^^^^, (3.5) 



v^ crM) 






w 



3 



with the convention 1/0 = +oo. Now, we would like to use the idea of 
reweighting using previous estimates, in the same as we did in Section |2} if 



Ax is a solution to (3.3), we want to use for instance 

Wj = (Tj{Ax), 



and find a solution to the problem (3.4) for this choice of weights. But, let 



us stress the fact that, while we call || ■ ||i^^t, the weighted nuclear norm, it is 
not a norm, since it is not a convex function in general! A simple counter- 
example is as follows. If Wi > W2 (which is usually the case since singular 
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values are taken in a non-increasing order) then for A = diag(l, 0, . . . , 0) and 
B = diag(0, 1, 0, . . . , 0), we have 



2 2wi Wi 2\wi W2) 



A + B 



l,w 



hence || ■ ||i^^ is not convex. Moreover, since the aim of || ■ ||i^^ is to promote 
low-rank matrices, the weight vector w should be chosen non-increasing, 
corresponding precisely to the case where || ■ ||i^u; is non-convex (note that 
when < wi < W2 < ■ ■ ■ < Wni/\n2, it is easy to prove that || • ||i,„, is a norm). 



Consequently, (3.4) is not a convex minimization problem in general, and a 
minimization algorithm is very likely to be stuck at a local minimum. But 
we would like to stick to the idea of reweighting, since it worked well for CS. 
The first idea that may come to mind is to use a convex relaxation of the 
non-convex function || ■ ||i^^„ (just as convex relaxation of the rank function 
led to the nuclear norm), but it simply leads back to the nuclear norm itself! 
Indeed, it can be proved that ii wi > W2 > ■ ■ ■ > WniAn2 > O5 the convex 
envelope of || ■ ||i^^ on the ball {A : ||v4||i < 1} is simply A \-^ \\A\\i/wi. 



Let us go back to the original problem (3.3). It turns out that (3.3) is 



equivalent to the fact that Ax satisfies the following fixed-point equation: 

Ax = Sx{V^iAx)+VniAo)), (3.6) 

where Sx is the spectral soft-thresholding operator defined for every B G 



Sx{B) = Ub diag (((Ti(B) - A)+, . . . , (a,ank(B)(5) - A)+) V, 



T 
B ! 



where B = Us'^ByB i^ the SVD of S, with S^ = diag(cri(-B), . . . , cri.ank(_B)(-B)) 
This fact is easily explained. Indeed, define f2{A) = ^\\Vn{A) — Vn{AQ)\\l, 
which is a differentiable function with gradient V/2(v4) = Vn{A) — Vq{Aq) 
and fi{A) = A||y4||i, which is a non-differentiable convex function. We 
will denote by 9/1(^4) the sub differential of /i at A. The fact that Ax G 
argmin^ 1/2(^4) + fi{A)} is equivalent to the fact that G d{fi + f2){Ax) = 
{V f2{Ax)} + dfi{Ax) (for the Minkowskii's addition of sets), that we rewrite 
in the following way: 

Ax - Vf2{Ax) -Axe dfi{Ax). (3.7) 
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On the other hand, a standard tool in convex analysis is the proximal oper- 
ator, [TTj, [S]. The proximal operator of a convex function, for instance /i, 
is given, for every B G M"'^^"^^, by 



proxr^(i?) = argniin < -||^ — B\\l 

■ T>ni xn2 l- 2 



fM) 



the minimizer being unique since A i— )■ |||y4 — i?||2 + /i(v4) is strongly convex. 
But, since d{\\\- -B\\l + fi{-)){A) = {A - 5} + 9/i(A), the point prox^^(S) 
is uniquely determined by the inclusion 



B -pT0Xf_^{B) e df,{pTOXj^{B)). 



(3.8) 



So, choosing B = Ax — V/2(Aa) in ( 3.8[ ) and identifying with (3.7) leads to 
the fact that Ax satisfies the fixed-point equation 

iA = prOX^,(iA-V/2(iA)), 



which leads to (3.6) on this particular case, since we know that proxj^(i3) = 
Sx(B) (see Proposition |2] below). Note that the same argument proves that, 
if we add a ridge term to the nuclear norm penalization, namely 



Ax,r = argmin <^ \\Vn{A) - Vn{Ao)\\l + 2X\\A\\i + r\\A\ 



(3.9) 



for any r > 0, then and equivalent formulation is the fixed point equation 

1 



Ax,r 



1 + r 



SxiV^iAx,r)+VniA,)), 



(3.10) 



and the minimizer is unique this time, since the objective function is now 
strongly convex. 

The argument given above is at the core of the proximal operator theory, 
and leads to the so-called proximal forward-backward splitting algorithms, 
see [T71 HO] and [3]. Since these algorithm are optimal among the class of 
first-order algorithms, they drawn a large attention in the machine learning 
community, see for instance the survey [2J . Another advantage in the case of 
matrix completion is that such an algorithm can handle large scale matrices, 
see Remark [2] below. 



So, we have seen that (3.3) and (3.6), or (3.9) and (3.10) are equivalent 



formulations of the same problem. So, instead of considering (3.4), we could 
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consider the corresponding fixed-point problem. Unfortunately, since || ■ ||i^ 
is non-convex, the above arguments based on the subdifferential does not 
make sense anymore. But still, we can consider an estimator defined as a 
fixed point equation for the weighted soft-thresholding operator. 

Theorem 4. Assume that r > and wi > ■ ■ ■ > WniAn2 ^ 0- Let us define 
the matrix A^ as the solution of the fixed-point equation 



A\ 



1 + r 



SriV^iA-,)+VniAo)), 



(3.11) 



where S^ is the weighted soft-thresholding operator given by 



S^iB) = UBdi8.g(^[a^iB) 



Wi 



O'rank(B) 



(B) 



X 



Wrank(B) 



(3.12) 



where B = f/^ diag(cr(_B))V^ is the SVD of B. Then, the solution to (3.11) 
exists and is unique. 



Theorem |4] is proved in Section 4^ below, and is a by-product of our 
analysis of the iterative scheme to approximate the solution of (3.11). The 



parameter r > can be arbitrarily small (in our numerical experiments we 



take it equal to zero, see Section 3.2), but it ensures unicity and convergence 



of the iterative scheme proposed below. Once again, let us stress the fact 
that ( 3.11[ ) (with r = 0) is not equivalent to (3.4) in general, since A i-)- 
II A II 1^ is not convex. 



The consideration of (3.11) has several advantages: we guarantee unicity 



of the solution, while the problem (3.4) may have several solutions, and 
it is easy to solve the fixed-point problem (3.11) using iterations. Even 



further, from a numerical point of view, it can be easily used together with a 



continuation algorithm, as explained in Section 3.2 below, to compute a set 



of solutions for several values of the smoothing parameter A. 

The next Theorem proves that iterates of the fixed-point Equation (3.11 ) 
converges exponentially fast to the solution. 

Theorem 5. Take A^ as the matrix with zero entries and define for any 
k > 0: 

A'+' = ^S^iV^iA') + VniAo)). (3.13) 

1 + r 
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Then, for any n > 1, one has: 



IK-^"l|2<^7^^||P.(Ao)|h, 



where A^ is the solution of (3.11). 



The proof of Theorem |5] is given in Section 4.2 The main step of the 



proof is to estabhsh the Lipshitz property of the weighted soft-thresholding 
operator, see Proposition [3J Since S^ is not a proximal operator (the ob- 
jective function is not convex), we cannot use directly the property of firm- 
expansivity, which is a direct consequence of the definition of a proximal 



operator, see the discussion in Section 4.2 



3.2 Numerical study 

3.2.1 Algorithms 

In this Section we compare empirically the quality of reconstruction using 
nuclear norm minimization (3.3[) (NNM), or equivalently (3.6), and weighted 



spectral soft-thresholding (3.11) (WSST). To compute the NNM we use the 
Accelerated Proximal Gradient (APG) algorithm of [47] using the MATLAB 
package NNLS, which is a state-of-the-art solver for the minimization prob- 



lem (3.3). This algorithm is based on an accelerated proximal gradient algo- 
rithm, itself based on the accelerated gradient of Nesterov, see |10l SI] and 
the FISTA algorithm, see [3] and see also [29] for a similar algorithm. In the 
APG algorithm, we use the linesearch and the continuation techniques, see 
[TTj . but we don't use truncation, since it led to poor results in the prob- 



lems considered here. The target value of A for NNM and WSST (see (3.3) 



and (3.11)) is simply taken as Atarget = £^ x ||Pq(Ao)||oo, with e = 10~^ or 
e = 10~^ depending on the problem, see below. The solution coming out 
of the APG algorithm is denoted by A]^ . Note that we could have used 
the FPC [31j or SVT [^ algorithms instead, but it led in our experiments to 
poorer results compared to the APG (in particular when looking for solutions 
with a rank of order, say, 100 on "real" matrices, like in the inpainting or 
recommanding systems, see below). 

The WSST is computed following the Algorithm [T] below. The first while 
loop is a continuation loop, that goes progressively to Atarget- Doing this 
instead of using Atarget directly is known to improve stability and rate of 
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convergence of the algorithm. It does not take more time than using Atarget 
directly (actually, it usually takes less time), since we use warm starts: when 
taking a smaller A, we use the previous value Anew (the solution with the 
previous A) as a starting point. Once we reached Atarget, we obtain a first 
solution of the fixed point problem (3.11 ), denoted hy A[' . Then, we update 



the weights by taking Wj = aj{A\^ ), and we start all over. We don't use 
a continuation loop again, since we are already at the desired value of A. 
We keep the parameter A fixed, we only repeat the process of updating the 



weights and finding the solution to the fixed point (3.11) K times. By doing 
this, we are typically going to decrease (eventually a lot) the final rank of 
the WSST, while keeping a good reconstruction accuracy. This process of 
updating the weights is usually not long. Typically, after a small number 
of iterations, two fixed-point solutions before and after an update are very 
close, so that our choice i^ = 50 is typically too large, but we keep it this 
way to ensure a good stability of the final solution. 



Note that in Algorithm [I] we use the iterations (3.13) with r = 0, since it 



gives satisfactory results. We use a simple stopping rule H^new— Aoidlh/H^oidlb < 
tol with tol = 5 X 10~^ or tol = 10~^ depending on the scaling of the problem, 
see below. We used in all our computations g = 0.7 and K = 50. For a fair 
comparison, we always use, for a reconstruction problem, the same param- 
eters £,tol and A for both NNM and WSST. Of course, for the WSST we 
need to rescale A by multiplying it by wi (the first coordinate of the weights 
vector, which is equal to ai(A^^^) at the first iteration). 

Remark 2. A good point with WSST is that it can handle large scale matrices, 
since at each iteration one only needs to store A^d, which is a low rank matrix 
(coming out of a previous spectral soft-thresholding) and Vn^Aoid + Aq), 
which is a sparse matrix. 

Remark 3. The overall computational cost of WSST is obviously much longer 
than the one of NNM, since we use K iterations, and since we don't use 
accelerated gradient, linesearch and other accelerating recipes in our imple- 
mentation of WSST. This is done purposely: we want to compare the quality 
of reconstruction of the "pure" WSST, without helping computational tricks, 
that usually improves rate of convergence, but accuracy of reconstruction as 
well (this is the case if one compares NNM with and without these tools) . 



Algorithm 1: Computation of tlie iteratively weiglited spectral soft- 
thresholding. 



Input: The observed entries Vn{Ao), a preliminary reconstruction A-^ 
parameters Ai > Atargct > 0, < g, tol < 1, K > 1 



(0) 



and 



Output: The WSST reconstruction A 



(K) 



Put A„ 



0, A = Ai and take Wj 



^Mf) 



while A > Atargct do 
Put 5 = +00 
while 5 > tol do 

^old — ^ncw 
^ncw ^ ^x v^old 



VniAoid)+VniAo)) 



7T.-p 



A 



6 
end 

A = Ax 
end 

Put A^^^ = A 
for k = I, . 

Put Wj 

while 6 > tol do 

^old — ^new 
^ncw = ^x (^old 



old||2/ 



I A 



old||2 



new 

i^do 



-M?) 



and 5 = +oo 



Vn{A^id)+Vn{Ao)) 



6 



end 



7T.-p 



Aoidlb/Poidlk 



end 
return A 



(K) 
X 
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3.2.2 Phase transition 

In Figure [3} we give a first empirical evidence of tfie fact tliat WSST improves 
a lot upon NNM. For eacli r G {5, 10, 15, ... , 80}, we repeat tlie following 
experiment 50 times. We draw at random U and V as 500 x r matrices 
with iV(0, 1) i.i.d entries, and put Aq = UV~^ (which is rank r a.s.). Then, 
we choose uniformly at random 30% of the entries of Aq, and compute the 
NNM and the WSST based on this matrix. In Figure |3j we show, for each r 
(x-axis), the boxplots of the relative reconstruction errors \\A — Ao||2/||^o||2 
over the 50 repetitions for A = NNM (top-left) and A = WSST (top-right). 
On this example, we observe that NNM is not able to recover matrices with 
a rank larger than 35, while WSST can recover matrices with a rank up to 
70. The boxplots of the ranks recovered by NNM and WSST are on the 
second line, where we observe that WSST always recovers the true rank up 
to a rank of order 70, while NNM correctly recovers the rank (only most of 
the time) up to a rank 35, and overestimates it a lot for larger ranks. So, 
on this simulated example, we observe a serious improvement of NNM using 
WSST, since the latter has the exact reconstruction property for matrices 
with twice a larger rank (70 instead of 35). 

3.2.3 Image inpainting 

In Figure 111 we consider the reconstruction of four test images ( "lenna" , 
"fingerprint", "fiinstones" and "boat"). Each test image has 512 x 512 pixels, 
and is of rank 50. We only observe 30% of the pixels, picked uniformly at 
random, with no noise. The observations are given in the first line of Figure |4| 
where non-observed pixels are represented by white. The second line gives 
the reconstruction obtained using NNM. The third line shows the difference 
between the true image and the recovery by NNM, where blue is perfect 
recovery and red is bad recovery. The fourth line shows the reconstruction 
using WSST and the fifth shows the difference between the true image and 
recovery by WSST. 

On all four images, the recovery is much better using WSST, in partic- 
ular on the fingerprint and fiinstones images. This can be understood form 
the fact that these two are very structured images. The most surprising fact 
is that all the four reconstructions using NNM have rank 150 (because of 
the way we choose A, see above), while the rank of the reconstructions ob- 
tained with WSST is never more than 90 (with the same choice of A). So, 
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5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 



5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 



Figure 3: Boxplots of the recovery errors (first line) and recovered ranks 
(second line) using NNM (left) and WSST (right) of a 500 x 500 rank r 
matrix with r between 5 and 80 (x-axis) 
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WSST leads to simpler (with a lower rank, which is better in terms of com- 
pression/description) and more accurate reconstructions. In particular, we 
observe that WSST is able to recover in a more precise way the underlying 
geometry of the true images (for instance, on the third line, first column, we 
can recognize the shape of lenna, while this is not the case with WSST). 

3.2.4 Collaborative filtering 

Now, we consider matrix completion for a real dataset: the MovieLens data. 
It contains 3 datasets, available on http://www.grouplens.org/: 

• movie- lOOK: 100,000 ratings for 1682 movies by 943 users 

• movie-lM: 1 million ratings for 3900 movies by 6040 users 

• movie-lOM: 10 million ratings and 100,000 tags for 10681 movies by 
71567 users 

The ranks of the users are integers between 1 and 5. In each 3 datasets, each 
user has rated at least 20 movies. For our experiments, we simply choose 
uniformly at random half of the ratings of each user to form a subset F of 
the entire subset Q or ratings. Then, based on the ratings in F, we try to 
predict the ratings in il — F. Since many entries are missing, we measure the 
accuracy of completion by computing the relative error in ^2 — F. If A is a 
reconstruction matrix, we reproduce in Table [T] below the values of 

err= ||Pf,_r(i)-Pn-r(Ao)||2/||Pn-r(Ao)||2, (3.14) 

together with the rank used for the reconstruction. We observe in Table [T] 
that WSST improves a lot upon NNM on each datasets. The most surprising 
fact is that the rank used by WSST is much smaller than the one used by 
NNM, while leading at the same time to strong prediction improvements. 
For movie-lM for instance, the prediction error of WSST is 30% better than 
NNM, while NNM solution has rank 200 and the WSST has rank 40. Once 
again, we can conclude on this example that WSST gives both much simpler 
reconstructions, and better prediction accuracy. Note that we considered a 
maximum rank equal to 200 for the movie-lOOK and movie-lM datasets, and 
equal to 50 for movie- lOM (to make this problem computationally tractable 
on a normal computer). 
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Figure 4: Image reconstruction using NNM and WSST. First line: observed 
pixels (white means non-observed). Second line: reconstruction using NNM. 
Third line: difference between truth and NNM (red is bad, blue is good). 
Fourth line: recovery using WSST. Fifth line: difference between truth and 
WSST. 
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relative error 




rank 




ni/n2 


m 












NNM WSST 


NNM 


WSST 


movie-lOOK: 


943/1682 


l.OOe+5 


3.92e-01 3.30e-01 


128 


33 


movie-lM: 


6040/3702 


l.OOe+6 


3.83e-01 2.70e-01 


200 


40 


movie-lOM: 


71567/10674 


9.91e+6 


2.76e-01 2.36e-01 


50 


5 



Table 1: Relative reconstruction errors for the MovieLens datasets. 

4 Proofs 

4.1 Proofs for Section [2] 

We denote by i^' the space M*^ endowed with the £p norm. The unit ball 
there is denoted by -B*^. We also denote the unit Euclidean sphere in M.^ 
by (S*^~^. We denote by (ei, . . . , e^) the canonical basis of M^ and for any 
/ C {1, . . . , A^} denote by M.^ the subspace of M^ spanned by {ci : i E I). Let 
A = [A{i}, . . . , ^{AT}] be a matrix from M^ to M"*, where A^i^ denotes the 
i-th column vector of A. Let x G M^ and I an arbitrary subset of {1, ... , N}. 
We define Aj = [A^ij : i E I] the matrix from M^ to M™ with columns vectors 
A^ij for i E I . We denote by xj the vector in MJ with coordinates Xi for 
i E I, where Xi is the i-th coordinate of x. We denote by x^ the vector of 
M^ such that x[ = when i ^ I and xf = Xi when i E I. If w G M^ has 
non negative coordinates, we denote by wx the vector {wiXi, . . . , wnXn) and 
by x/w the vector {xi/wi, . . . , xn/w^) with the previous convention in case 
where Wi = for some i. We denote by |x| the vector (|xi|, . . . , |xAr|). The 
support of X is denoted by Ix, this is the set of alH G {1, . . . , A^} such that 
Xj 7^ 0. We also consider the rt;- weighted £^-norm 

N I I 

|x|i,^ = V— . (4.1) 

tt ^^ 

Note that | ■ |i^i„ is a norm only when restricted to M^™, where /^ is the 
support of w. 

We start with the well-known null space property and dual characteriza- 
tion [6j of exact reconstruction of a vector by £i-based algorithms. 

Proposition 1. Let x,w E M^ and denote by I^ [resp. I^) the support of x 
{resp. w). The following points are equivalent: 
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1. A^(Ax) = X, 

2. Ix C Iw and for any h G ker Aj^ such that h ^ then 






h 



0, 



3. Ix C Iw and there exists Y e (kei Aj^)^ such that {wi^Y)j^ = sign(x/^) 
and \{wi^Y)jc\oo < 1- 



Proof. It follows from (2.3) that, under each one of the three conditions, we 



have Ix C Iw Therefore, to simply notations, we can work as if the ambient 
space were M"^™. Hence, without loss of generality, we assume that M^™ = M^. 
We also denote hj I = Ix the support of x. 

[Point 2. entails Point 1.] Using standard arguments (see for instance [Sj), 
we can see that the subgradient of | ■ |i,i„ at x G M^ is the set 



d\x\ 



l,w 



|t G M^ : ti = sgn{xi)/wi when Xi ^ 

and |tj| < l/wi when Xj = OJ. 



(4.2) 



Using the definition of the subgradient of | ■ |i,«, at x, it follows that for any 

heR^, 

\x + /^|l,^„ > |x|i,^ + \{h/w)jG\i + (sgn(x/), {h/w)i). 
Thus, if Point 2 holds then for any h G ker A such that h y^ 0, 

|X ~r 'i|l.u? ^ I -^ 1 1,1/; 

and thus Point 1 is satisfied. 

[Point 3. entails Point 2.] Let Y G (kei A)^ such that {wY)i = sgn(x/) 
and \{wY)jc\oo < 1- For any /i 7^ in ker A, we have 



\{h/w)jc\i + {sgn{xi),{h/w)i) = (sgn(x)^ + sgn(/i)-^ ,h/w) 
= ({sgn{x)/wy + {sgn{h)/wy , h) 
= ((sgn(x)/w7)^ + {sgn{h)/wy — Y, h) 



hj 



{{sgn{h)/w)jc -Yjchjc) = ^ — (sgn(/ii) - WiYi) > 0, 



iefi 
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where we used Point 3 in the fourth inequahty. 

[Point 1. entails Point 3.] This follows from classical results on the min- 
imization of a convex function over a convex set (cf. |33]). Nevertheless, we 
provide a direct proof following the argument of [6]. Denote by {ci, . . . , e^} 
the canonical basis in M.^ and by B^^ the unit ball associated to the w- 
weighted i'^-noim: 

S,^^ = {teM^:|t|i,^<l}. (4.3) 

If X is the unique solution of (2.2) then \x\i^wBf^r\{x+keT A) = {x}. Then by 



a duality argument (for instance Hahn-Banach Theorem for the separation of 
convex sets), there exists Y G M^ such that x + ker A C Pi, where Pi = {t : 
(t,F) = 1} and \x\i^wBi^ C P<i, where P<i = {t : {t,Y') < 1}. Introduce 
Fi^w{x) = \x\i^w conv {wiCi : Xi ^ 0), the face of \x\i^wB^^ containing x. By 
moving the hyperplan Pi, we can assume that \x\i^wBi^ fl Pi C Fi,„(x). 
Since |a;|i,^-Bf^^ C P<i, we have snp^^\^\_^ ^^n {t^Y) < 1 thus |(wF)|oo < 

l/|3^|i,w Moreover, x G Pi so 1 = (x, F ) < |x|i^i„|(t(;y)|oo < 1 because 
I (if F) loo < l/|3;|i,«;. This is the equality case in Holder's inequality, so it 
follows that {wY)i = sgn(x/)/|x|i^^. Then, for any i ^ I, |x|i^^W7jej G 
|x|i,^i?j^^, thus {\x\i^wWiei,Y) < 1 and \x\i^^Wiei ^ -Fi,,«(x), so \x\i^y,Wiei ^ 
Pi thus (Ixli^^WjCj, F) < 1. That is, 1(^1^)^-0100 < l/kli.w Finally, for any 
h G ker A, 1 = {x + h, Y) = {x, Y) + {h,Y) = l + {h, Y), thus {h, Y) = 
and Y G (ker A)-*-. Then, we normalize Y by |x|i^^ to obtain Point 3. D 

Both Criterions 2 and 3 in Proposition [l] can be used to characterize the 
exact reconstruction of a vector x by the £i-weighted algorithm. The vector Y 
of Criterion 3 is now called an exact dual certificate (cf. [6l |26]). We will use 
Criterion 3 and the construction of an exact dual certificate from ^ to prove 
Theorems [l] and |2} Note that Criterion 2 together with the construction of 
an inexact dual certificate (cf. [26j) can also be used. Nevertheless, we do 
not present this construction here since it does not improve the statement of 
Theorem |2l 

4.1.1 Proof of Theorem [l] 

In the same way as we did in the proof of Proposition [T| we can work as 
if the ambient space were M^™ and assume, without loss of generality, that 
IR-^™ = M.^ . We denote by / the support of x. We prove first that when 
Ai(y4x) = X, then Aj is injective. Indeed, suppose that there exists some 
h E R^ such that h ^ and Ajh = 0. Denote by h° G M^ the vector 
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such that h'j = h and h'j^ = 0. We have /i° 7^ and Ah'^ = Ajhi = 0. 
In particular, for any A 7^ 0, A/i G ker^ — {0}. Therefore, since x is the 
unique solution of the Basis Pursuit algorithm, it follows from Point 2 of 
Proposition [l] (applied to the weight vector w = (1, . . . , 1)), that, for every 
A 7^ 0, (sgn(a;/), A/i^) > 0. This is not possible, so Af is injective. 
Since Ai{Ax) = x, the decoder A2 is given here by 

N I I 

A2{Ax) E argmin ( y, t~i '■ ^^ = ^^ ) • 



Therefore, according to (2.3), we have A2(Ax)j = for any i ^ /, that is 
supp(A2(Ax)) C /. As a consequence AiXi = Ax = AA2{Ax) = AiA2{Ax)i 
and Aj is injective thus, xj = A2{Ax)j. Since x^c = = A2{Ax)jc, we have 
X = A2{Ax). 

4.1.2 Proof of Theorem H 

We adapt to our setup the "dual certificate" introduced in [6] and consider 

F° = A'^AMjAj)-' (^1^) . (4.4) 

V w J I 

In particular, we have Y^ G im(A^) = (kerA)-*- and 

Y,^=AjAMjA,rr-^^) =('^^). 

\ W J I \ W J I 

Thus, we have {wY^)[ = sgn(a;7). In view of Proposition ol it only remains 
to prove that |(tfF°)j-c| < 1 with high probability. For < 6 < 1 and yU > 0, 
we consider the events 

noil, 6) = {{l-6)\y\l< \Ajy\l < (1 + 6)\y\l, Vy G R'} (4.5) 

and 

ni{I,fx) = {max\AjA{^}\2< fi}. (4.6) 

First, note that since Aj Aj — Id is Hermitian, we have 

\\AjAj-Id\\2^2= sup \\Ajy\l-l\. 

|j/|2 = l 
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Thus, on Qq{I,S), we have ||AjA/ — Id\\2^2 < ^ and so for any y G M^, 
[(AjA/)"^?/!^ < (1 — 5)~^\y\2- In particular, 



(AjA,)-(^ 



< 

2-1-5 



sgn(xj 
w 



1-6 



(1/-),| 



Then, it follows that, on Qo{I, 6) fl ^i{I, x) and under condition {AO){I, (1 — 



|(ti;r°),c|oo = max|w,4|A,(AjA,)-i(^^) 



< max Wi max 



[AjA{,},iAjA 



w /I 
i/sgn(x) 



< maxwjmax L4j y4{j} (y4j y4/) ^( 



/i 



sgn(a; 



); 



w 



< z rmaxwi|(l/w) I < 1. 



Then, Theorem |2] follows from the probability estimates of f2o(/, 5) fl f2i(J, /x) 
provided in the next lemma. 

Lemma 4.1. Let A = m'^/'^i^gij) be a m x N matrix where the Qij's are 
i.i.d. standard Gaussian variables. Assume that 



m> cq max 



s slogA^' 



.52' /i2 
With probability larger than 1 — 2exp(— Cim^^) — exp(— C2/i^Tn/s), we have 

{l-5)\y\l<\Aiy\l<{l + 5)\y\l ^y eM.' 
and max^gjc |AjA{j}|2 < /x. 
Proof. For the sake of completeness, we recall here the classical e-net argu- 



ment to prove the first statement of Lemma AA_. It is enough to prove that 
supj^g^/ ll^/Z/li ~ 1| — ^5 where S^ is the set of unit vectors of £^ supported 
on /. First, note that 

sup \\Aiy\l - l| = sup \{Ty,y)\ = ||T||2^2, 
yes' yes' 
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where T : IR-'^ — )■ M^ is the symmetric operator A^ A — I^. Let A C 5'^ be 
a 1/4-net of S^ for the £2 metric with a cardinahty smaller than 9'^ (the 
existence of such a net follows from a volumetric argument, see [32] )■ For 
any y G S^ , there exists z G A such that y = z + u with \u\2 < 1/4 and 
therefore, 

\{Ty,y)\ < \{Tz,z)\ + \{Tu,u)\ +2\{Tz,u)\ < m&x\{Tz,z)\ H [^^• 

Hence, ||T||2-s>2 < (16/7) max^gA 1(^2;, 2)!, and it is enough to control the 
supremum of y — )■ \{Ty, y)\ over A instead of S^ . 

Let y E A. We denote by Gi/y/m, . . . , Gm/y/m the row vectors of A 
where Gi, . . . , Gm are m independent standard Gaussian vectors of M^. We 
have {Ty,y) = m~^YlT=i{^i^y) " 1- Since \\{G,y) ||^, = \\{G,y)\\l^, it 
follows from Bernstein inequality for ?/;i random variables [50] that 

P[|(T|/,|/)|<5] >l-2exp(-cim52), 

and a union bound yields 

F[\{Ty,y)\ < 5 , Vy G A] > 1 - 2exp(slog9 - cim(5^). 

Combining the £-net argument with this probability estimate we obtain that 
when m > C2s/S'^ then ||T||2^2 < S with probability at least 1 — 2exp ( — 
€37716'^) . 

Now, we turn to the second part of the statement. Let i E I . The i-th 
column vector of A is Ajj} = Gi/y/m = {gn, . . . , gimV / V^ where the Gj's 
are independent standard Gaussian vectors of M™. Let g > 2 to be chosen 
later. By Markov inequality. 



P 



AlA{i} 



>f^ 



m m 

^[\Y.9ijGji\^ > m/i] < {mfi)~''E\Y,9ijGji\\ (4.7) 
i=i ^ i=i 



Now, we use the vectorial version of Khintchine inequality conditionally to 
Gij, . . . , Gmj, to obtain, for some absolute constant C4, 






9\i/g 



< 



m 2 1/2 ™' 

c^V^y^a I XI ^^J^^^^ 2) = ^4^ ( 5Z I ^J^ 



2\V2 

2, 
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It follows that 

m 

5 ' - ^q/2 



E ^ gij Gji < {clqmsY 

Hence, in (4.7) for q = (/i/(2c|)) (m/s), we obtain 

aT A 1 / /i^m log 2 \ 



The result follows now from an union bound. D 

4.1.3 Proof of Theorem M 

Proof. Assume that Ar{Ax) = x and define y = Ar+i{Ax). By construction 
of y, we have supp(y) C supp(a;) and Ax = Ay. So, since A is injective on 
Em and x — y E S^, we have x = y. This proves that Ar+i{Ax) = x, and 
that the sequence (A„(y4x))„ is constant and equal to a [m/2j -sparse vector 
starting from the r-th iteration. 

Now, assume that there exists an integer r and y G T.^m/2\ such that 
Ar{Ax) = Ar+i{Ax) = ■ ■ ■ = y. In particular, we have Ay = Ax, so since A 
is injective on E^, and x — y E E^, we have x = y. O 

4.2 Proofs for Section [3] 

The next proposition shows that weighted spectral soft-thresholding achieves 
the minimum of the weighted nuclear norm plus a proximity term. Note that, 
however, weighted spectral soft-thresholding is not a proximal operator, since 
the weighted nuclear norm is not convex. This entails in particular that the 
proofs below use a direct analysis, since we cannot use arguments based on 
subdifferential computations here. 

Proposition 2. Let S e M"i^"^ r, A > and wi > ■ ■ ■ > Wn,An2 > 0. Then 
the minimization problem 

n\/\n2 



min {hA-B\\l + \y''-^ + ^A\\l\ 



a,{A) , r, 
has a unique solution, given by j^S'^{B), where S^{B) is the weighted soft- 



thresholding operator (3.12) 
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Proof of Proposition \^ Denote for short q = ni /\n2 and write the S VD of A 
as y4 = UTy^ = Yl'j=i CjUjvJ where U = [ui, . . . ,Uq], V = [vi, . . . ,Vq] and 
S = diag(cri, . . . , aq). We have 



q 

2 



\\A-B\\1=\\B\\1-2J2 ^j^jBv, + {l + r)J2 ^] 
so that we want to minimize the function 

0([/, V^, S) = - 5^ ( - 2a,u]Bv, + (1 + r)aj) + A J^ ^ 

over [/, y, S with the constraints f/^f/ = /, V^V = I and ai > ... > 
(Tg > 0. Using the variational characterization of singular values, if S = 
U'J:'V'^ is the SVD of 5, where U' = K, ...,u'],V'= [v[, ...,v'g],^' = 
diag (cr']^, . . . , 0"^), we know that the maximum of u Bv over all vectors u and 
V subject to \u\2 = |f I2 = 1 and u orthogonal to u\, . . . , u'j_^ and v orthogonal 
to u^, . . . , v'j_i is achieved at u'^ and v^, and is equal to a'y So the maximum 
of (^([/, V, S) is achieved ai U = U' and V = V, and 



0(f/', ^', S) = - 5^ ( - 2a,a; + (1 + r)a| 



2 ■2A^ 



w 



It is easy to see that for each j the the minimum over (Jj is achieved at 
aj = Y+^io'j — :^)+, which is non- increasing. D 

As mentioned before, S"^ is not a proximal operator. A nice property 
about proximal operators is that they are firmly non-expansive, see [H]. 
Namely, if T is the proximal operator of some convex function over an Hilbert 
space H, then we have 

\\Tx — Ty|p < \\x — y\\'^ — \\x — y — {Tx — Ty)\\'^ 

for any x,y E H. However, it turns out that we can prove, using a direct 
analysis, that S'^ is non-expansive. Once again, the proof uses a direct and 
technical analysis (since we cannot use arguments based on sub differential 
computations), while the property of firm-nonexpansivity of proximal oper- 
ators is an easy consequence of their definition. 
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Proposition 3. Let Wi > ■ ■ ■ > w„^An2 > 0, A > 0. Then, for any A,B & 
^nixna^ u;e have 

\\S^iA)-S^iB)h<\\A-Bh. 

Proof of Proposition \^ Let us assume without loss of generality that A = 1 . 
Write the SVD of A and B as A = t/iEi\// and B = t/aSa^^ where Si = 
diag[(Ji,i, . . . , CTi^n], ^2 = diag[cr2,i, . . . , cra^r-j] and ri (resp. ra) stands for the 
rank of A (resp. B). We also write for short A = S^{A) = UiTiiV^ and 
B = S^iB) = U2^2V^ where Si = diag[(o-i,i - 1M)+, . . . , (^i,,, - 1M,)+] 
and S2 = diag[(cr2,i — l/'^i)+5 ■ ■ ■ 1 ("^a.ri — l/ifr2)+]- We want to prove that 
||yl — -BII2 — 11^ — -BII2 > 0. First use the decomposition 

\\A - B\\l-\\A - B\\l = \\A\\l - \\A\\l + \\B\\l - \\B\\l - 2(A,B) + 2(A,B) 

II 11^ II 11^ II 11^ II 11^ II 11^ II 11^ ^ ' / \ ' I 

j=i j=i J j=i j=i J 

-2{{A,B)-{A,B)), 

where we take fi such that ctij > l/wj for j < fi and aij < 1/wj for 
j > fi + 1, and similarly for f2. We decompose 

{A, B) - {A, B) = {A-A,B-B) + {A,B-B) + {A- A, B) (4.8) 

Using von Neumann's trace inequality (X, y ) < '^jCrj{X)aj{Y) (see for 



instance [2H], Section 7.4.13), it follows for the first term of (4.8) that 

riAr2 

[A-A,B-B)<Y,{i:^- Si),-,(S2 - S2),,, 



i=i 



Using the same argument for the two other terms of (4.8), we obtain 

riAr2 



[A,B) - {A,B) < J2 ((^1 - Si),,,(S2 - S2),,, + (Si),,,(S2 - S2 
+ (Si - Si)j,j(S2)jjj, 



We explore the case ri < r2 and ri < r2; the other cases follow the same 
argument. We have 



''1 , 1,1 '^1 



[A, B)-{A,B)<J2^+ (^2, --)-+ E ^1.^2, 



Wj \ Wn/ Wj 
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so, an easy computation leads to 



ri r2 ri 



\A-B\\l-\\A-B\\l> E ^L-+ E 4.-2 E ^i'^-^2 

i=f2+l i=r'2+l j=r2+l 



r2 



2a2,j 1 



_ \ U7 o U7 ,■ 

J=rl^-l -^ J 

We obviously have EjLf^+i ^?j + EJlf^+i ^i,j - 2 EjLf^+i ^ij^2,i > 0. By 
definition of f2 and ri, we have cxij < l/wj < a2j for any j = fi + 1, . . . ,f2- 
Hence, we have 

alj - 2(Tij(T2j H 2 = (^iJ - 2cr2j + l/wj){aij - 1/wj) > 0, 

Wj Wj 

which concludes the proof of Proposition [3j D 



Proof of Theorem^ Consider the sequence {A'^)k>Q defined in (3.13). Using 
Proposition [3] we have for any k>l 

W^' -A% = -jr^ASl{Vn{A,) + V^{A')) - S^iVn{Ao) + V^iA'-'))\[ 
(1 + r) 

(1 + r) (1 + ^) 

so that \\A''+^ - A'^y < (l + r)-'=||Ai-A°||2. This proves that ^;^>oP^'+i - 
A''\\2 < +00, SO the limit of {A^)k>o exists and is given by 



A°° = Y.i'^''^^ - ^'') + ^°- 



fc>0 



Now, by continuity of S"^ and V^, taking the limit on both sides of (3.13), 
we obtain that A°° satisfies the fixed-point equation 

A"- = -^S^{V^{A^)+Vn{Ao)), 
1 + r 

so we have found at least one solution. Let us show now that it is unique, so 
that A^ = A°°: consider a matrix B satisfying the same fixed point equation. 
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We have 
IIS - A-'h = ^^-^||Sr(P„(A„) + V^iB)) - Sl(V„{A„) + Pi(^~))||2 



100 I 



therefore B = A°°. U 

Proof of Theorem \^ We know from the proof of Theorem |4] that 

fc>n k>n ^ ' 

leading to the conclusion. D 
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